A stochastic model incorporates randomness; given the same inputs, a stochastic model may generate different outputs on different runs.
This is useful when the process you want to model is fundamentally stochastic, or if there is heterogeneity in its dynamics.
Unfortunately stochastic models can get complicated, and come with a lot of notation and formalism. Our goal here is to quickly review basic probability theory so that we can use stochastic models productively.
A random variable is a variable whose value is random. In this course, random variables will usually have some real-world meaning, like: * \(X\) is the value of a fair coin toss * \(X\) is the number of new HIV infections diagnosed this week * \(X\) is the number of people who had a drug overdose in the last 24 hours * \(X\) is the survival time of a subject living with cancer
Probability \(\Pr(\cdot)\) is a function whose argument is the value a random variable can take on, which returns a positive number less than or equal to 1. The set of all such values describes the probability distribution.
For discrete random variables, we will write the probability mass function \(\Pr(X=x)\). For continuous random variables, we will use the density, defined below.
We usually describe random varibles in terms of their probability distribution. For example, consider the random variable \(X\) defined by \(\Pr(X=0) = 1/2\) and \(\Pr(X=1)=1/2\). What might \(X\) refer to?
All random variables on the real line have a cumulative distribution funciton (CDF) \[ F(x) = \Pr(X < x) \] The cumulative distribution function fully characterizes the probability distribution of \(X\).
The density of a continuous random varible is \(f(x)\), the derivative of CDF \(F(x)\) of \(X\), where it exists \[ f(x) = \lim_{h\to 0} \frac{F(x+h) - F(x)}{h} \]
\(\Pr(Y=y|X=x)\) is the probability that \(Y=y\), given that \(X=x\). We interpret the statement on the right-hand side of the conditioning sign \(|\) as deterministic. Bayes’ theorem says how to compute conditional probabilities:
\[ \Pr(Y=y|X=x) = \frac{\Pr(Y=y,X=x)}{\Pr(X=x)} \]
Example: \(X\sim\text{Binomial(n,p)}\) and \(0< k \le n\).
\[ \Pr(X=k|X>0) = \binom{n}{k} p^k (1-p)^{n-k} / (1-(1-p)^n) \]
The expected value of a random variable is its “average” value.
\[ E[X] = \sum_x x \Pr(X=x) \]
or
\[ E[X] = \int_{-\infty}^\infty x dP(x) \]
average of values x that X can take on, weighted by their probability.
Can a random variable have expectation equal to a value that X can never take? Yes. In the example above \(\Pr(X=x)=1/2\) for \(x=1\) and \(x=0\). But \(E[X]=1/2\).
Conditional expectations are just expectations with respect to a conditional probability distribution. \[ E[Y|X=x] = \sum_y y\Pr(Y=y|X=x) \] \[ E[Y|X=x] = \int_{-\infty}^\infty y f(y|X=x) dy \]
\[ \Pr(X=x) = \sum \Pr(X=x|Y=y) \Pr(Y=y) \] \[ \Pr(X=x) = \int \Pr(X=x|Y=y) dP(y) \]
Law of total expectation
\[ E[X] = \sum E[X|Y=y] \Pr(Y=y) \] \[ E[X] = \int E[X=x|Y=y] dP(y) \]
Two random variables are independent if their joint probability factorizes into the product of their marginal probabilities, \[ \Pr(X=x,Y=y) = \Pr(X=x) \Pr(Y=y) \] e.g. independent coin flips, die rolls, etc.
This is the classic coin-flipping distribution.
The random variable \(X\) has Bernoulli\((p)\) distribution if \(\Pr(X=1)=p\) and \(\Pr(X=0)=1-p\).
A Binomial\((n,p)\) random variable is a sum of \(n\) Bernoulli variables, each independent with probability \(p\). The probability mass function is \[ \Pr(X=k) = \binom{n}{k} p^k (1-p)^{n-k} \] where \(0 \le k \le n\). The probability mass function tells a nice story!
ks = 0:20
p = dbinom(ks, size = 10, p = 1/2)
plot(ks, p, pch = 16, bty = "n", main = "Binomial probability mass function")A Geometric\((p)\) is the number of Bernoulli\((p)\) trials to achieve a given number successes. The probability mass function is
\[ \Pr(X=k) = (1-p)^{k-1} p \] where \(0 \le k\). The probability mass function tells a nice story!
ks = 0:20
p = dgeom(ks, p = 1/3)
plot(ks, p, pch = 16, bty = "n", main = "Geometric probability mass function")A common probability model for count data is \[ \Pr(X = k) = \frac{(\lambda t)^k e^{-\lambda t}}{k!} \] This distribution looks weird, but it has a nice story, which we will see when we talk about Poisson processes.
ks = 0:20
lam = 5
p = dpois(ks, lam)
plot(ks, p, pch = 16, bty = "n", main = "Poisson distribution")Usually defined on \([0,1]\). \(X \sim \text{Unif}(0,1)\) means the density is \[ f(x) = 1 \] for \(0\le x \le 1\).
If \(X\) has exponential distribution, the density is
\[ f(x) = \lambda e^{-\lambda x} \]
\[ F(x) = 1-e^{-\lambda t} \]
Memoryless property: \[ \Pr(X>t+h| X>t) = \frac{\Pr(X>t+h)}{\Pr(X>t)} = \frac{e^{-\lambda(t+h)}}{e^{-\lambda t}} = e^{-\lambda h} = \Pr(X>h) \]
xs = seq(0, 10, by = 0.01)
lam = 2
p = dexp(xs, rate = lam)
plot(xs, p, type = "l", bty = "n", main = "Exponential probability density")The normal\((\mu,\sigma^2)\) density is
\[ f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp\left[ -\frac{1}{2\sigma^2} (x-\mu)^2 \right] \]
Where does this functional form come from? It’s complicated.
xs = seq(-5, 5, by = 0.01)
lam = 2
p = dnorm(xs)
plot(xs, p, type = "l", bty = "n", main = "Normal probability density")There are lots of other “canonical” probability distributions
Discrete: Negative binomial, beta binomial, counting distributions
Continuous: Gamma, Lognormal, Laplace, survival/accelerated failure time distributions
We won’t use these in this class, but it’s important to understand that we have lots of options for modeling random quantities. This is partly what the field of statistics is about.
Another important class of stochastic model we will see in this class is a decision tree.
A decision tree represents a sequence of decisions, actions, or states, in a probability model. You can think of an individual jumping from left to right along the nodes in the tree. At each node, they follow a link with the given probabilities. This means their path through the tree is stochastic.
Let’s look at a simple decision tree for the treatment of individuals diagnosed with a certain disease. Following diagnosis, each individual gets either treatment A or B, with certain probabilities. Then either they survive or die by some follow-up time, depending on which treatment they got. We can use the information in this tree to learn about the treatments and optimal courses of action.
Computing “marginal” and conditional probabilities is easy via law of total probability: \[ \Pr(\text{Alive}|\text{Treatment A}) = 0.7 \] \[ \Pr(\text{Alive}|\text{Treatment B}) = 0.1 \] \[ \Pr(\text{Alive}) = 0.4 \times 0.7 + 0.6 \times 0.1 = 0.34 \]
Here is a different tree for the same problem. Suppose we want to know whether treatment A or B is better.
The total probability of being alive is \[ \Pr(\text{Alive}) = 0.9 \times 0.6 + 0.1 \times 0.9 = 0.63\]
Let’s look at the alive people and see which treatment they got: \[ \Pr(\text{Treatment A}|\text{Alive}) = 0.9 \times 0.6 / 0.63 = 0.86 \] \[ \Pr(\text{Treatment B}|\text{Alive}) = 0.1 \times 0.9 / 0.63 = 0.14 \]
Most of the alive people got Treatment A. Treatment A must be best! Is that true?
“Statistical models” are stochastic because they contain random variables. Consider the usual Normal/Gaussian linear model \[ y = \alpha + \beta x + \epsilon \] where \(\epsilon \sim N(0,\sigma^2)\). This model is stochastic because \(\epsilon\) is random.
In my opinion, you should think about so-called statistical models as mechanisitic. Look at the relationships they imply, and evaluate whether they capture the relevant dynamics of the system under study.
n = 100
eps = rnorm(100)
x = runif(n)
y = 0.1 - 2 * x + eps
plot(x, y, bty = "n")
lines(x, 0.1 - 2 * x)Let \(T_i\) be the survival time of individual \(i\). A survival model is a model for \(T_i\), and is usually expressed in terms of the hazard of failure, \(\lambda_i(t)\). Formally,
\[ \lambda_i(t) = \lim_{h\to 0} \frac{\Pr(T_i \in (t,t+h)|T_i>t)}{\Pr(T_i>t)} \]
Since \(T_i\) is stochastic, this is a stochastic model.
In this course, we will focus on Markov stochastic models in discrete and continuous time. A Markov model is one whose next event depends only on its current state, and not on its prior history.
It is sometimes possible to work with non-Markov stochastic models, but it is often not worthwhile.
\[ P_{ij}(t) = [\mathbf{P}^t]_{ij} \]
We described deterministic systems in terms of their rates earlier. But a stochastic model produces different output each time we run it. How can we describe its rates of change when its path is different every time?
We need to describe the rate of change of its transition probabilities. Consider a continuous-time Markov chain \(X(t)\) on a discrete state space \(S\). First, define the transition rate between states \(i\) and \(j\) in \(S\) as \(\lambda_{ij}\). Then define the the transition probabilities
\[ P_{ij}(t) = \Pr(X(t)=j|X(0)=i) \]
for \(i,j\in S\).
To describe the dynamics of the system, we can write a system of differential equations,
\[ \frac{dP_{ij}(t)}{dt} = \sum_{k\neq i} \lambda_{kj} P_{ik}(t) - \lambda_j P_{ij}(t) \]
where \(\lambda_j = \sum_{k\neq j} \lambda_{jk}\). When \(S\) is finite, a matrix representation can be formed
\[ \mathbf{P}'(t) = \mathbf{P}(t) \boldsymbol{\Lambda} \]
with initial condition \(\mathbf{P}(0)=\mathbf{I}\). The solution is given by the matrix exponential
\[ \mathbf{P}(t) = \exp[\boldsymbol{\Lambda} t] \]
Several properties of CTMCs are useful:
Drawbacks:
Now suppose, starting at time 0, we keep track of the number \(N_t\) of recurrent events that happen before time \(t\). This is called a counting process. Suppose each event \(i\) occurs after a waiting time
\[ W_i \sim \text{Exponential}(\lambda) \]
following the last event. Then \(X\) is a continuous-time Markov counting process.
What is the distribution of the number of events \(N_t\) that occur before \(t\)?
\[ N_t \sim \text{Poisson}(\lambda t) \]
So \(E[N_t] = \lambda t\). This is where the Poisson distribution comes from.
Now consider a counting process whose waiting times follow
\[ W_i \sim \text{Exponential}(\lambda_i) \]
Many count-dependent counting processes are of interest. These specify a functional form for \(\lambda_i\) as a function of \(i\). Let’s look at some.
Let \(X(t)\) be the number of individuals in a population, and let \(X(0)=x_0>0\) be the initial number. Suppose the population gains another individual with the rate
\[ \lambda_k = k\lambda \]
The expected number of individuals at time \(t\) is \[ E[X(t)] = x_0 e^{\lambda t} \] Exponential growth!
Consider a counting process \(X(t)\) and suppose the population has \(X(0)=x_0\) individuals initially. When \(X(t)=k\), the population loses one with rate \[ \mu_k = k\mu \] The expected number of individuals at time \(t\) is \[ E[X(t)] = x_0 e^{-\mu t} \] Exponential decay!
Consider a counting process \(X(t)\). When \(X(t)=k\), births happen with rate \(\lambda_k = k\lambda\) and deaths with rate \(\mu_k = k\mu\).
The expected number of individuals at time \(t\) is
\[ E[X(t)] = x_0 e^{(\lambda-\mu) t} \]
So we see exponential growth or decay, depending on the relationship of \(\lambda\) and \(\mu\).
Consider a counting process \(X(t)\). When \(X(t)=k\), births happen with rate \(k\lambda\) and immigrations with constant rate \(\alpha\). Deaths happen with rate \(k\mu\) and emigrations with rate \(\delta\).
\[\lambda_k = \alpha + k\lambda \] \[\mu_k = \delta + k_mu \]
The expectations for this model are complicated.
A useful class of self-exciting process is given by the Hawkes process:
\[ \lambda(t) = \alpha(t) + \sum_{i=1}^{N_t} M_i e^{-\gamma (t-t_i)} \]
The rate of new events is a function of the events that have already occurred. Each new event at \(t_i\) increases the rate of new events by a factor \(M_i\), which decays exponentially subsequently.
A queue is model for a list of individuals waiting to be served. Examples:
Let \(X(t)\) be the number of individuals waiting. In the most basic queue model, arrivals happen as a Poisson process with rate \(\lambda\). Individiuals are served, and exit the queue, with rate \(\mu\).
Gonsalves et al. (2017)
The “Gillespie algorithm” performs forward simulation of a stochastic compartmental model Pineda-Krch (2008). You don’t really need to know how it works. Basically forward simulation of continuous-time Markov models is fairly computationally straightforward.
The ssa function in the GillespieSSA package implements a fast version of the Gillespie algorithm.
library(GillespieSSA)
parms <- c(lam=1)
x0 <- c(X=1)
a <- c("lam*X")
nu <- matrix(1)
out <- ssa(x0,a,nu,parms,tf=10,simName="Pure birth")
plot(out$data[,1],out$data[,2],col="red",cex=0.5,pch=19, xlab="Time", ylab="Number")ssassa(x0, # initial state vector
a, # propensity vector
nu, # state-change matrix
parms, # model parameters
tf # final time
)head(out$data)## X
## timeSeries 0.0000000 1
## 0.1668619 2
## 0.5045949 3
## 0.5359663 4
## 0.8149916 5
## 0.8594286 6
parms <- c(b=2, d=1, K=1000)
x0 <- c(N=500)
a <- c("b*N", "(d+(b-d)*N/K)*N")
nu <- matrix(c(+1,-1),ncol=2)
out <- ssa(x0,a,nu,parms,tf=10,simName="Logistic growth")
plot(out$data[,1], out$data[,2], type="l", xlab="Time", ylab="Number")
abline(h=1000, lty="dashed", col="gray")parms <- c(beta=0.001, gamma=0.1)
x0 <- c(S=499,I=1,R=0)
a <- c("beta*S*I","gamma*I")
nu <- matrix(c(-1,0,+1,-1,0,+1),nrow=3,byrow=TRUE)
out <- ssa(x0,a,nu,parms,tf=100,simName="SIR model")
#ssa.plot(out)
plot(out$data[,1], out$data[,2], col="green", ylim=c(0,500), type="l", xlab="Time", ylab="Number")
lines(out$data[,1], out$data[,3], col="red")
lines(out$data[,1], out$data[,4], col="blue")
legend(50,500,c("S", "I", "R"), pch=16, col=c("green", "red", "blue"))Gonsalves, Gregg S, A David Paltiel, Paul D Cleary, Michael J Gill, Mari M Kitahata, Peter F Rebeiro, Michael J Silverberg, et al. 2017. “A Flow-Based Model of the Hiv Care Continuum in the United States.” Journal of Acquired Immune Deficiency Syndromes (1999) 75 (5). NIH Public Access: 548–53.
Pineda-Krch, Mario. 2008. “GillespieSSA: Implementing the Stochastic Simulation Algorithm in R.” Journal of Statistical Software 25 (12): 1–18.